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Abstract 

Background:VJe have investigated tlie role that the mutation rate and the structure of genetic variation at a locus play in 
determining whether a gene is involved in disease. We predict that the mutation rate and its genetic diversity should be 
higher in genes associated with disease, unless all genes that could cause disease have already been identified. 

Results: Consistent with our predictions we find that genes associated with Mendelian and complex disease are 
substantially longer than non-disease genes. However, we find that both Mendelian and complex disease genes are found 
in regions of the genome with relatively low mutation rates, as inferred from intron divergence between humans and 
chimpanzees, and they are predicted to have similar rates of non-synonymous mutation as other genes. Finally, we find that 
disease genes are in regions of significantly elevated genetic diversity, even when variation in the rate of mutation is 
controlled for. The effect is small nevertheless. 

Conclusions: Our results suggest that gene length contributes to whether a gene is associated with disease. However, the 
mutation rate and the genetic architecture of the locus appear to play only a minor role in determining whether a gene is 
associated with disease. 

Citation: Eyre-Walker YC, Eyre-Walker A (2014) The Role of Mutation Rate Variation and Genetic Diversity in the Architecture of Human Disease. PLoS ONE 9(2): 
690166. doi:l 0.1 371/journal.pone.00901 66 

Editor: I. King Jordan, Georgia Institute of Technology, United States of America 
Received August 23, 2013; Accepted January 28, 2014; Published February 27, 2014 

Copyright: © 2014 Eyre-Walker, Eyre-Walker. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which 
permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. 

Funding: These authors have no support or funding to report. 

Competing interests: The authors have declared that no competing interests exist. 

* E-mail: a.c.eyre-walker@sussex.ac.uk 



Introduction 

Why do humans sufTer from the diseases that we do? In part this 
is clearly due to our anatomy and physiology, and that of the 
organisms that infect us - we cannot liave a disease of an organ 
that we do not possess. But why do we suffer from cystic fibrosis 
rather than some other disease of the lungs? One simple reason 
might be variation in the mutation rate. Those genes and genomic 
regions that have high mutation rates are more likely to generate 
disease mutations, and hence be associated with a disease. The 
rate of mutation of a locus will depend upon two factors: the rate 
of mutation per site and the number of sites at which a mutation 
can generate a disease phenotype. The per site mutation rate is 
known to vary across the human genome at a number of different 
scales such that some genes have mutation rates that are several 
fold higher than other genes (reviewed in Hodgkinson et al. [1]). 
Genes also vary considerably in their length, with some of the 
largest, such as the dystrophin gene, being association with disease. 

A more subtie factor affecting the likelihood of a gene being 
associated with a disease is the genealogy. At each site in the 
genome there is an underlying genealogy whereby every 
chromosome in the population is related via a bifurcating tree to 
every other chromosome at that site. If there is no recombination 
between sites then sites share the same genealogy. The shape and 
depth of the genealogy depends on several factors. The first is 
chance; for example, the average total length of a genealogy for a 
neutral locus in a population of stationary size is expected to be 



proportional to 4jV generations in a diploid species, where jVis the 
population size, but this is expected to have a variance of at least 
(4jV)^ generations [2]. Second, the genealogy depends on the 
efiFective population size of the locus (jV,). jV^ is thought to vary 
across the human genome as a consequence of natural selection 
[3,4] . Selection can reduce the N,, of a genomic region through 
either a selective sweep caused by the passage of an advantageous 
mutation through the population [5] , or via background selection 
caused by the removal of deleterious mutations [6] . Those regions 
of the genome with low rates of recombination or a high density of 
selected sites are expected to have low jV„ and this is expected to 
reduce the genetic diversity of neutral and weakly selected variants 
in these regions (reviewed in [7]). Analyses suggest that Ne varies 
across the human genome by a few-fold [4]. The effective 
population size is not expected to affect the frequency of 
deleterious mutations in which the product of jVj and the strength 
selection is greater than one. However, stochastic factors affecting 
the genealogy are expected to be important irrespective of the 
selection acting upon a mutation. 

Previous analyses have shown that Mendelian disease genes are 
30% longer than non-disease genes [8,9]. Comparative analyses 
have also shown that genes associated with Mendehan diseases 
have significandy, but only slightly higher rates of mutation per 
site, as inferred from levels of synonymous divergence between 
species [8,10]. The rather modest differences between disease and 
non-diseases genes in the inferred mutation rate might be due to 
time frame over which the mutation rate was inferred: Smith and 
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Figure 1 . CDS length. (A) IVlean total CDS length, and (B) Mean average CDS length. Total CDS length is the sum of all constitutive and alternately 
spliced exons; average CDS length is the average CDS length of each transcript. Error bars represent the 95% confidence intervals. 
doi:1 0.1 371 /journal.pone.00901 66.g001 



Eyre-Walker [8] considered the divergence between human and 
mouse, and Huang et al. [10] considered the divergence between 
mouse and rat. This will give a poor estimate of the current 
mutation rate at a locus in humans because the relative mutation 
rate of a locus appears to have evolved through time [1,11]. The 
mutation rate has also recently been predicted, based on a model 
fitted to the locations of de novo mutations in humans, to be slightly 
higher in disease associated genes [12], but the accuracy of this 
model is unproven, and they consider the total mutation rate of the 
exon, rather than the rate at non-synonymous sites. Here we 



consider the divergence between humans and their most closely 
related extant relative, chimpanzee, as our measure of the 
mutation rate. We also consider whether the density of single 
nucleotide polymorphism (SNP) is greater in disease than non- 
disease genes. 

Materials and Methods 

To estimate mutation rate for each gene, we estimated their 
intron divergence between the human and chimpanzee genomes 
as follows. Alignments using the NCBI build 36 version of the 
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Figure 2. Mutation rates. The mutation rate per site, as inferred from intron divergence between Inuman and chimpanzee. A) Intron divergence 
per site between human and chimpanzee; B) the predicted non-synonymous mutation rate per CDS site. Error bars represent the 95% confidence 
intervals. 

doi:1 0.1 371/journal.pone.00901 66.g002 



human genome (hg 1 8) and PanTro2 version of the chimp genome 
were downloaded from the UCSC website (http:/ /genome.ucsc. 
edu/). Alignments were parsed into individual genie sequences 
and realigned with MAFFT version 6 (http:/ /mafft.cbrc.jp/ 
alignment/ software/). Exon sequences were masked according 



to exon annotation of the NCBI build 36 version of the human 
genome from the ensemble database (http://www.ensembl.org/). 
We did not correct for multiple hits; this is not necessary since the 
average intron divergence between human and chimpanzee 
sequences is 1.05% [13]. We calculated the rates of intron 
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divergence for CpG and nonCpG sites separately since the former 
have much higher rates of mutation. We used these intron 
divergences to infer tlie rate of non-synonymous mutation in 
human exons, by calculating the number of CpG and non-CpG 
sites in each exon which when mutated would give a non- 
synonymous change; in this calculation we assumed that all 
mutations at CpGs are transitions, which is a good approximation 
[14], and that 60% of mutations at other sites were transitions. If a 
gene had multiple transcripts we made these calculations for each 
transcript and averaged the result. 

DNA sequence diversity data were taken from the 1 000 genome 
project [15]. 

Genes were designated as being associated with Mendelian 
disease based upon the compilation made by Blekhman et al. [16]. 
Genes associated with genome-wide association studies (GWAS) 
were obtained from the GWAS catalog (http://www.genome.gov/ 
gwastudies/); a gene in which the strongest GWAS signal was 
found within the boundaries of a gene were designated as being a 
GWAS gene. 

To investigate what factors might influence patterns of genie 
mutations, estimated by intron divergence, we considered a 
number of variables. Intron GC content, nucleosome occupancy, 
replication timing and male and female recombination rates were 
downloaded from the UCSC website (http://genome.ucsc.edu/). 
We used A365 values to study the influence of nucleosome 
occupancy on the distribution of genie mutations rate across the 
genome. Recombination rates per MB were from Kong et al [17]. 
Replication time data were from Chen et al. [18] and Hansen et al. 
[19]. Qualitatively similar results were obtained using each of four 
replication time datasets, so we only present the analysis using data 
from an embryonic stem cell line BG02 [19]. Germ-line expression 
data were from a study by McVicker and Green [20]. 

The dataset for this analysis is available as Table S 1 . 

Results 

We predict that unless all possible diseases with a genetic basis, 
and all the genes that can cause them, have already been 
discovered, then genes associated with diseases should have higher 
genie mutation rates than non-disease genes, where the genie 
mutation rate is determined by the product of gene length and the 
mutation rate per site. We also predict that disease genes should be 
in relatively diverse regions of the genome. To investigate these 
predictions we compiled data from 17577 nuclear genes with 
introns, of which 854 genes are known to cause a Mendelian 



disease. We also analysed 1732 genes in which the strongest signal 
in a genomic region in a genome wide association study (GWAS) 
lay within the boundaries of the gene (i.e. all exons and introns 
between the start and stop codon). The presence of an association 
signal within the boundaries of the gene does not necessarily mean 
that the causative mutation is within the protein coding sequence 
or even within the boundaries of the gene, and many of these 
associations may be in regulatory sequences [21]. We subsequentiy 
excluded genes on the sex chromosomes since the Y-chromosome 
is known to have a higher mutation rate and the X-chromosome a 
lower mutation rate than the autosomes [22]. This yielded a 
dataset of 1 7062 autosomal genes including 820 associated with a 
Mendelian disease and 1 726 with a GWAS signal. Details of the 
dataset are given in Table SI. 

Gene Length 

Consistent with the hypothesis that disease genes should have 
higher overall rates of mutation we find, as others have in the past 
for genes causing Mendelian disease [8,9], that genes associated 
with disease are significandy longer, in terms of their total coding 
sequence (CDS) length (i.e. the sum of all constitutive and 
alternatively spliced exons), than non-disease genes; Mendelian 
disease genes are ~28% and GWAS genes ~44% longer than 
non-disease genes (One-way ANOVA p<0.001) (Figure la). A 
similar pattern is evident for average CDS length; both Mendelian 
and GWAS disease genes are 50% longer than non-disease genes 
(Figure lb). The difference in average CDS length is greater than 
in previous studies [8,9], but this is likely to be due to the 
improvement in genome annotation; the average length of genes is 
slightly shorter than in previous analyses. 

Strikingly, the diflference in length is as great or greater for the 
GWAS than the Mendelian disease genes despite the fact that 
many of the GWAS signals are likely to be outside the protein 
coding sequence [21]. GWAS genes might have longer CDSs for 
three reasons. First, genes with longer CDSs have a greater chance 
of generating a disease mutation. Second, longer genes are more 
likely to have a non-causative marker SNP in the CDS that is 
associated with the disease. And finally since intron and total CDS 
lengths are correlated (r= 0.36, p<0.001), genes with long CDSs 
have longer introns and hence an increase chance of having 
causative or non-causative SNPs in their introns. However, if we 
control for the correlation between intron and CDS length by 
regressing CDS length against intron length and taking the 
residuals, we find that GWAS genes have longer CDSs, than non- 
disease genes, even given their longer introns (t-test p<0.001; 



Table 1. Standardised regression coefficients from multiple regressions. 



Predicted non- 
synonymous mutation 

Factor intron Divergence rate Intron SNP density Average genealogy lengtli 



GC content 


0.525*** 


0.325*** 


0.1 92*** 


-0.212*** 


Nucleosome occupancy 


-0.396*** 


-0.1 67*** 


-0.41 2*** 


-0.035 


Female recombination rate 


-0.020* 


0.018* 


0.058*** 


0.042*** 


Male recombination rate 


0.202*** 


0.143*** 


0.129*** 


-0.048*** 


Germ-line expression 


-0.062*** 


-0.116*** 


-0.020* 


0.032*** 


Replication time 


-0.132*** 


-0.157*** 


-0.071*** 


0.038*** 


Distance to telomere 


-0.158*** 


-0.097*** 


-0.117*** 


0.060*** 


Distance to centromere 


-0.018* 


-0.016 


0.020* 


0.014 



Note that the replication time data is such that a negative slope indicates an increase in the variable through the cell cycle * p<0.05, ** p<0.01 and *** p<0.001. 
doi:l 0.1 371/journal.pone.0090166.t001 



PLOS ONE I www.plosone.org 



4 



February 2014 | Volume 9 | Issue 2 | e90166 



Mutation Rate and Disease 



A) 

0.012 - 




Non-disease IVIendelian GWAS 

B) 



0) 

.y 1.200 - 




Non-disease IVIendelian GWAS 



C) 



0.016 - 




Non-disease Mendelian GWAS 



Figure 3. Diversity and genealogy estimates. The diversity in 
disease and non-disease genes measured as the A) average intron SNP 
density, B) the average intron SNP density divided by intron divergence, 
C) and the mean minor allele frequency (MAF). Error bars represent the 
95% confidence intervals. 
doi:1 0.1 371 /journal.pone.00901 66.g003 

similar results are obtained if we regress log CDS length against 



log intron length (p<0.001)). This suggests that GWAS genes are 
not simply longer because they have longer introns; it therefore 
seems that either GWAS genes are more likely to be associated 
with disease because some causative mutations are within their 
exons, or because there is a greater number of marker SNPs in 
exons. 

Mutation Rates 

However, contrary to our expectations, we find that disease 
genes are found in regions of the genome with significantly lower 
per site mutation rates, as measured by intron divergence between 
human and chimpanzee. The difference is highly significant (one- 
way ANOVA p<0.001), but the difference is small with disease 
genes having approximately 5% lower intron divergence than 
non-disease genes (Figure 2a). The pattern differs between CpG 
and non-CpG sites, with disease genes having lower divergence at 
CpG sites and either similar or higher divergence at non-CpG sites 
(results not shown). If we calculate the expected non-synonymous 
mutation rate in the CDS by multiplying the proportion of non- 
synonymous sites that are CpG and non-CpG in the CDS by the 
respective levels of intron divergence, we still find that both 
Mendelian and complex disease genes have slightly lower 
mutation rates per site than non-disease genes (p = 0.004) 
(Figure 2b). As expected, both MendeKan and complex disease 
genes have significantly higher overall predicted rates of non- 
synonymous mutation (p<0.001), driven by the fact that disease 
genes have longer CDSs. 

The fact that disease genes have lower predicted rates of non- 
synonymous mutation per site is inconsistent with our hypothesis, 
but this might be due to the fact that they have features which 
predispose them to lower mutation rates - for example they might 
be transcribed at lower levels and hence have lower rates of 
mutation [23]. Divergence at intronic and intergenic sites is known 
to be significantly correlated to a number of other variables 
including GC-content [3,18,24,25], recombination rate 
[3,24,26,27,28], replication time [18,29,30], distance to the 
telomere and centromere [3,13,18,24], gene density [3,24], 
nucleosome occupancy [1 1] and expression level [23]. We confirm 
previous results and show that intron divergence is positive 
correlated to GC content and male recombination rate within a 
multiple regression; and that intron divergence is negatively 
correlated to replication time (later genes have higher divergence), 
distance to the telomere, distance to the centromere, female 
recombination rate, nucleosome occupancy and germ-line expres- 
sion (Table 1). Similar patterns are evident for the predicted non- 
synonymous mutation rate (Table 1). If we take the residuals from 
a multiple regression of intron divergence against all the genomic 
variables above we find that intron divergence and the predicted 
rate of non-synonymous mutation do not differ significandy 
between disease and non-disease genes. 

Genetic Diversity 

Although, disease genes are found in regions of the genome with 
relatively low rates of intron mutation we find that disease genes 
have a significantly greater density of polymorphisms segregating 
in their introns than non-disease genes; the difference is 1 1 % and 
17% for the Mendelian and GWAS genes respectively (Figure 3a). 
If we divide the density of SNPs by the divergence of introns to 
estimate a quantity that is proportional to the average length of the 
genealogies at the locus, we find that Mendelian and GWAS genes 
have significandy longer average genealogy lengths that are 9% 
and 12% greater than non-disease genes (ANOVA p<0.001; t-test 
of Mendelian versus non-disease p<0.001; t-test of GWAS versus 
non-disease p<0.001). It is odd that the difference between disease 
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and non-disease genes is less pronounced for average genealogy 
length than diversity given that disease genes have lower intron 
divergence than non-disease genes. This is probably due to non- 
linearities associated with ratios. 

Although, we find that disease genes have higher diversities and 
a\erage genealogy lengths than non-disease genes, we find no 
evidence that the predicted non-synonymous population mutation 
rate in the CDS (calculated as the proportion of non-synonymous 
sites that are CpG multiplied by the SNP density at CpG sites in 
introns plus the proportion of non-synonymous sites that are non- 
CpG multiplied by the SNP density at non-CpG in introns) differs 
between disease and non-disease genes. However, the calculation 
of the predicted non-synonymous population mutation rate is 
subject to considerable error because we have relatively few intron 
CpG sites and SNP density is very low in humans. 

It is possible that disease genes have higher diversities and 
average genealogy lengths because disease genes have features that 
predispose them to higher values, not because by having higher 
values they are more likely to be associated with disease. We find 
that intron SNP density is positively correlated to GC content, 
female and male rates of recombination and distance to the 
centromere and negatively correlated to the time of replication 
(late genes have higher diversity), nucleosome occupancy, germ- 
line expression and distance to the telomere (Table 1). If control 
for these factors by taking the residuals from the multiple 
regression we find that SNP density is stiU significantly greater in 
both Mendelian and GWAS genes, than in non-disease genes 
(ANOVA p<0.001; individual t-tests p<0.001). Likewise we find 
the average genealogy length is positively correlated to all variables 
except GC content, nucleosome occupancy and male recombina- 
tion rate (Table 1), and that after controlling for these associations, 
disease genes still have significandy greater average genealogy 
lengths than non disease genes (ANOVA p = 0.019; individual t- 
tests Mendelian versus non-disease p = 0.21, GWAS versus non- 
disease p = 0.001). 

Although disease genes have a greater number of SNPs per bp 
than non-disease genes the distribution of the genetic variation 
varies in an inconsistent manner between categories of genes; the 
average minor allele frequency is ~10% greater in Mendelian, 
and ~10% lower in GWAS genes, than in non-disease genes 
(ANOVA p<0.01) (Figure 3b). 

Discussion 

We have found that genes associated with disease are longer and 
reside in regions of the genome with greater intron diversities and 
average genealogy lengths than non-disease genes. This is 
consistent with a role for mutation and genetic variation in 
determining whether a gene becomes associated with disease. 
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